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Abstract 

We analyse the average-case cache performance of distribution sorting algorithms in the 
case when keys are independently but not necessarily uniformly distributed. The analysis 
is for both 'in-place' and 'out-of-place' distribution sorting algorithms and is more accurate 
than the analysis presented in [13'. In particular, this new analysis yields tighter upper and 
lower bounds when the keys are drawn from a uniform distribution. We use this analysis to 
tune the performance of the integer sorting algorithm MSB radix sort when it is used to sort 
independent uniform floating-point numbers (floats). Our tuned MSB radix sort algorithm 
comfortably outperforms a cache-tuned implementations of bucketsort [11] and Quicksort when 
sorting uniform floats from [0, 1). 

1 Introduction 

Distribution sorting is a popular alternative to comparison-based sorting which involves placing 
n input keys into k < n classes based on their value [5]. The classes are chosen so that all the 
keys in the ith class are smaller than all the keys in the {i + l)st class, for i = 1, . . . , fc — 1, and 
furthermore, the class to which a key belongs can be computed in 0(1) time (e.g. if the keys 
are floats in the range [a,b), we can calculate the class of a key a:; as 1 -I- Lf5f ' ^J)- Thus, the 
original sorting problem is reduced in linear time to the problem of sorting the keys in each class. 
A number of distribution sorting algorithms have been developed which run in linear (expected) 
time under some assumptions about the input keys, such as bucket sort and radix sort. Due to 
their poor cache utilisation, even good implementations — which minimise instruction counts — of 
these 'linear-time' algorithms fail to outperform general-purpose 0(7ilogn)-time algorithms such 
as Quicksort or Mergesort on modern computers [H |TT] . 

Most algorithms are based upon the random-access machine model Ij, which assumes that 
main memory is as fast as the CPU. However, in modern computers, main memory is typically one 
or two orders of magnitude slower than the CPU 2]- To mitigate this, one or more levels of cache 
are introduced between CPU and memory. A cache is a fast associative memory which holds the 
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values of some main memory locations. If the CPU requests the contents of a memory location, and 
the value of that location is held in some level of cache (a cache hit), the CPU's request is answered 
by the cache itself in typically 1-3 clock cycles; otherwise (a cache miss) it is answered by accessing 
main memory in typically 30-100 clock cycles. Since typical programs exhibit locality of reference 
[1], caches are often effective. However, algorithms such as distribution sort have poor locality of 
reference, and their performance can be greatly improved by optimising their cache behaviour. A 
number of papers have recently addressed this issue [3 [HI El HH M El > mostly in the context of 
sorting and related problems. There is also a large literature on algorithms specifically designed 
for hierarchical models of memory '15' "2] , but there are some important differences between these 
models and ours (see JlO^ for a summary). 

The cache performance of comparison-based sorting algorithms was studied in [51 IHl [15 ^.nd 
distribution sorting algorithms were considered in [7l[8l[lT]. One pass of a distribution sort consists 
of a count phase where the number of keys in each class are determined, followed by a permute 
phase where the keys belonging to the same class are moved to consecutive locations in an array. We 
give an analysis of the cache behaviour of the permute phase, assuming the keys are independently 
drawn from a non-uniform distribution. In [13] we focused on 'in-place' permute, where the keys 
are rearranged without placing them first in an auxiliary array. In this paper we extend the analysis 
to 'out-of-place' permute. We model the above algorithms as probabilistic processes, and analyse 
the cache behaviour of these processes. For each process we give an exact expression for, as well 
as matching closed-form upper and lower bounds on, the number of misses. 

In previous work on the cache analysis of distribution sorting, [7] have analysed the (somewhat 
easier) count phase for non-uniform keys, and [llj gave an empirical analysis of the permute phase 
for uniform keys. The process of accessing multiple sequences of memory locations, which arises in 
multi-way merge sort, was analysed previously by [21 Ej. The analysis in [9 assumes that accesses 
to the sequences are controlled by an adversary; our analysis demonstrates, among other things, 
that with uniform randomised accesses to the sequences, more sequences can be accessed optimally. 
In [14j a lower bound on cache misses is given for uniform randomised accesses; our lower bound 
is somewhat sharper. The analysis also improves upon the results in 13J, by giving tighter upper 
and lower bounds when the keys are drawn from a uniform distribution. 

In practice there are often cases when keys are not uniform (e.g., they may be normally dis- 
tributed); our analysis can be used to tune distribution sort in these cases. We consider a different 
application here: sorting uniform floats using an integer sorting algorithm. It is well known that 
one can sort floats by sorting the bit-strings representing the floats, interpreting them as integers 
|4]. Since (simple) operations on integers are faster than operations on floats, this can improve 
performance; indeed, in [11 it was observed that an ad hoc implementation of the integer sorting 
algorithm most- significant-bit first radix sort (MSB radix sort) outperformed an optimised version 
of bucket sort on uniform floats. We observe that a uniform distribution on floating-point numbers 
induces a non-uniform distribution on the representing integers, and use our cache analysis to 
improve the performance of MSB radix sort on our machine. Our tuned 'in-place' MSB radix sort 
comfortably outperforms optimised implementations of other in-place or 'in-place' algorithms such 
as Quicksort or MPFlashsort llj, which is a cache-tuned version of bucket sort. 

2 Cache preliminaries 

This section introduces some terminology and notation regarding caches. The size of the cache 
is normally expressed in terms of two parameters, the block size {B) and the number of cache 
blocks (C). We consider main memory as being divided into equal-sized blocks consisting of B 
consecutively-numbered memory locations, with blocks starting at locations which are multiples 
of B. The cache is also divided into blocks of size B] one cache block can hold the value of exactly 
one memory block. Data is moved to and from main memory only as blocks. 

In a direct-mapped cache, the value of memory location x can only be stored in cache block 
c = {x div B) mod C. If the CPU accesses location x and cache block c holds the values from 
x's block the access is a cache hit; otherwise it is a cache miss and the contents of the block 
containing x are copied into cache block c, evicting the current contents of cache block c. For our 
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Permute phase(out-of-place permutation) 
1 for i := to n — 1 do 

key : = DATA [i] ; 

X := classify (Zee?/) 

idx : = COUNT Ixl ; 

COUNT [x] ++ ; 

DEST[idx] := key; 

Figure 1: Permute phase for an 'out-of-place' permutation in a generic distribution sorting algo- 
rithm. DATA holds the input keys. COUNT and DEST are auxiliary arrays. 

purposes, cache misses can be classified into compulsory misses, which occur when a memory block 
is accessed for the first time, capacity misses, which occurs on an access to a memory block that 
was previously evicted because the cache could not hold all the blocks being actively accessed, and 
conflict misses, which happen when a block is evicted from cache because another memory block 
that mapped to the same cache block was accessed. 

3 Distribution sorting 

As noted in the introduction, a distribution pass has two main phases, a count phase and a permute 
phase, and our focus here is on the latter. 

While describing this algorithm, the term data array refers to the array holding the input keys, 
and the term count refers to an auxiliary array used by these algorithms. Each pass consists of 
two main phases, a count phase followed by a permute phase. 

The count phase counts for class 1 < i < k — 1, the total number of keys in classes 0, . . . ,i — 1. 
For class i = this cumulative count is 0. Ladner et al [7] give an analysis of the count phase of 
distribution sorting on a direct-mapped cache for uniformly and randomly distributed keys. 

There are two main variants of the permute phase, in the first variant keys are permuted from 
the data array to the auxiliary destination array, this is called an out-of-place permutation. In the 
second variant keys in the data array are permuted within the data array, this is called an in-place 
permutation. 

3.1 Permute phase 

The permute phase uses the cumulative count of keys generated during the count phase, to permute 
the keys to their respective classes. We now describe the two variants of the permute phase. In 
the description below it is assumed that k has been appropriately initialised, and that the function 
classify maps a key to a class numbered {0, . . . , fc — 1} in 0(1) time. 

3.1.1 Out-of-place permute 

During an out-of-place permute, for any class j, unless all elements of that class have already 
been moved, COUNT [j] points to the leftmost (lowest-numbered) available location for an element 
of class J in an n element auxiliary array, DEST. Figure [T] shows the pseudo-code for out-of-place 
permutation. In Step 1, for each element in DATA: we determine its class; using the count array 
we determine the next available location for this key in the DEST array; we increment the count 
array, thus setting the location for the next key of the same class; finally we move the key to its 
location in DEST. Since each step takes constant time, this out-of-place permutation takes 0{n) 
time whenever k < n. 
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Permute phase(in-place permutation) 

1 leader := n — 1; 

1 idx : = leader; key : = DATA lidx] ; 

3.1 X := classify (fcey) ; 

3.2 idx := COUNT [d ; 

3.3 COUNT [x]++; 

3.4 swap key and DATA[irfa;] ; 

3.5 if idx ^ leader repeat 3.1; 

4 while (a; > A COUNT [x - 1] > START 1x1 ) 

x—; 

5 if (x > 0) leader := START [x]-l; 

go to 2; 

Figure 2: Permute phase for an 'in-place' permutation in a generic distribution sorting algorithm. 
DATA holds the input keys. COUNT and START are auxiliary arrays. After the count phase, COUNT is 
copied into START. 

3.1.2 In-place Permute 

The in-place permutation strategy described here is similar to that described by Knuth [51 Soln 
5.2-13]. Before an in-place permute phase begins, a copy of the count array is made in a fc element 
auxiliary start array. During the permute phase, for any class j, an invariant is that locations 
START [j] , START [j] +1,.. . , COUNT [j] - 1 contain elements of class j, i.e. COUNT [j] points to the 
leftmost (lowest-numbered) available location for an element of class j. Thus, for j = 0, . . . , fc — 2, 
all elements of class j have been permuted if COUNT Ijl > START Ij + 1] , and such a class will be 
called complete in what follows. Class fc — 1 is complete when COUNT [fc — 1] > n. Figure [H shows 
the pseudo-code for in-place permutation. We now describe this permutation, which consists of 
two main activities: cycle following and cycle leader finding. In cycle following, keys are moved to 
their final destinations in the data array along a cycle in the permutation (Steps 2 and 3). Once 
a cycle is completed, we move to cycle leader finding, where we find the 'leader' (index of the 
rightmost element) of the next cycle (Steps 1, 4 and 5). A cycle leader is simply the rightmost 
location of the highest-numbered incomplete class. By the definition of a complete class, initially 
the leader must be position n — 1. In more detail, the steps are as follows: 

• In Step 1 n — 1 is selected as the first cycle leader. 

• In Step 2 the key at the leader's position is copied into the variable key, thus leaving a 'hole' 
in the leader's position. 

• In Steps 3.1-3.5 the key key is swapped with the key at key^s final position. If key 'fills the 
hole', the cycle is complete, otherwise we repeat these steps. 

• In Step 4 the algorithm searches for a new cycle leader. Suppose the leader of the cycle 
which just completed was the last location of class j. When this cycle ends, class j must also 
be complete, as a key of class j has been moved into the last location of class j. Note that 
classes j + 1, j + 2, . . . must already have been complete when the leader of this cycle was 
found. Note that the program variable x has value j at the end of this cycle, so the search 
for the next leader begins with class j — I, counting down (Step 4) . 

• In Step 5 we check to see if all classes have completed and terminate if this is the case. 

Clearly the in-place permutation in one pass of distribution sorting takes 0{n) time whenever 
k < n. 
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4 Cache analysis 



We now analyse cache misses in a direct-mapped cache during the permute phase of distribution 
sorting when the keys are independently drawn from a non-uniform random distribution. In the 
permute phase of distribution sorting, when a key is moved to its destination, the algorithms 
described in Section [3] access any one of k elements in the COUNT array and any one of k locations 
in the DATA or DEST arrays, depending on whether the permutation is in-place or out-of-place. 
The actual locations accessed are dependent on the value of the permuted key, so, if the keys 
are independently and randomly distributed then, for every key permuted there are two random 
accesses to memory, one in the count array and one in DATA or DEST. These random accesses can 
potentially lead to a large number of cache conflict misses. 

Our approach is to define two continuous processes which model in-place and out-of-place 
permutations. Process "in-place" models an in-place permutation and is shown in Figure [31 and 
Process "out-of-place" models an out-of-place permutation and is shown in FigurelH Each round of 
a process models the permutation of a key to its destination, and we analyse the expected number 
of cache misses in n rounds of these processes. Our precise equations are difficult to compute so 
we also give closed-form upper and lower bounds on these precise equations. We use our results 
for in-place permutations to get upper and lower bounds on the expected number of cache misses 
in a process which models accesses to multiple sequences. 

The assumptions in the processes mean that we have to access at least n distinct locations in 
memory, which requires ^l(n/ B) cache misses. In the analysis, we will say that a process is optimal 
if it incurs 0{n/B) cache misses. In distribution sorting, the larger the value of k, the fewer the 
number of passes over the data, hence the fewer the capacity misses. As we will see, if k is too 
large, then there can be a large number of conflict misses. The aim of the analysis is to determine 
the largest value of k, for a particular distribution of keys, such that there are 0{n/B) misses in 
one pass of distribution sorting. 

4.1 Processes 

We now give the two processes which model the distributing of keys drawn independently and 
randomly from a non-uniform distribution into k classes. 

4.1.1 Process to model an in-place permutation 

Let k be an integer, 2 < k < CB. We are given k probabilities pi, ■ ■ ■ ,Pk, such that X)i=i-Pi — 1- 
The process maintains k pointers Di, . . . ,Dk, and there are also k consecutive 'count array' loca- 
tions, C = ci,...,Cfc. The process (henceforth called Process "in-place") executes a sequence of 
rounds, where each round consists in performing steps 1-3 below: 



Process "in-place" 

1. Pick an integer x from {l,...,fc} such that Pr[.T ~ i] = pi, 
independently of all previous picks. 

2. Access the location Cx- 

3. Access the location pointed to by D^, increment by 1. 



We denote the locations accessed by the pointer Di by di^i, di.2, . . ., for i = 1, . . . , /c. We assume 
that: 

(a) the start position of each pointer is uniformly and independently distributed over the cache, 
i.e., for each i, di^i mod BC is uniformly and independently distributed over {0, . . . , BC — 1}, 

(b) during the process, the pointers traverse sequences of memory locations which are disjoint 
from each other and from C, 
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1) Randomly select x from 1,.., k 



2) : : : 

access Cx 



t t t f 

access at Dx 
increment Dx 

Figure 3: Process "Inplace" . 



(c) ci is located on an aligned block boundary, i.e., ci mod i? = 0, 

(d) the pointers Di, for i = 1, . . . , fc, arc in separate memory blocks. 

Assuming that the cache is initially empty, the objective is to determine the expected number 
of cache misses incurred by the above process over n rounds, with the expectation taken over the 
random choices in Step 1 as well as the starting positions of the pointers. 

4.1.2 Process to model an out-of-place permutation 

This process is like Process "in-place" , but it is augmented with accesses to a sequence of con- 
secutive locations in a source array, <S, determined by an index s. The process, henceforth called 
Process "out-of-place" , executes a sequence of rounds, where each round consists in performing 
steps 1-4 below: 



Process "out-of-place" 

1. Access the location S[s], increment s by 1. 

2. Pick an integer x from {1,...,A;} such that Pr[a; = i] = Pi, 
independently of all previous picks. 

3. Access the location c^. 

4. Access the location pointed to by Dx, increment by 1. 



We make assumptions (a), (c), and (d) from Process "in-place", assumption (b) is modified as 
below and we add a further assumption: 

(b) during the process, the pointers traverse sequences of memory locations which are disjoint 
from each other, from C and from S. 

Assuming that the cache is initially empty, again the objective is to determine the expected 
number of cache misses incurred by the above process over n rounds, with the expectation taken 
over the random choices in Step 2 as well as the starting positions of the pointers. 

4.2 Preliminaries 

We now introduce some notation that will be used for the analysis. Wc tisc k to denote the number 
of classes that the keys will be distributed into, and throughout the analysis we assume that B 
divides k. Assume that we are given a set of k probabilities pi,. ■ . ,Pk, such that Yli=oPi ~ ^- 
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access S[s] 



2) Randomly select x from 1,.., k 



access Cx 



4) 



access at Dx 
increment Dx 



Figure 4: Process "Out-of-place" 



expected value of a function / of a random variable X is denoted as E[/(X)]. When we wish to 
make explicit the distribution D from which the random variable is drawn, we will use the notation 
Ex~d[/(-^)]- All vectors have dimension k (the number of classes) unless stated otherwise, and 
we denote the components of a vector a; by a;i, X2, . . . , x^. We now define some probabilities: 

(i) For alH e {1, . . . , k/B}, = Eif Pi- 

(ii) For alH e {!,..., k}, we denote by a* the following vector: a* = if i = j, and a* = —Pi) 
otherwise and by the following vector: 6* = if (i — 1)B + 1 < j < iB, and 6j ~ Pj/{^ ^ Pi) 
otherwise. (Note that = bj — 1). 

Let TO > be an integer and g be a vector of non-negative reals such that qi — I. We denote 
by ip{m, q) the probability distribution on the number of balls in each of k bins, when m balls are 
independently put into these bins, and a ball goes in bin i with probability q^, for i G {1, . . . , fc}. 
Thus, ^p{m,q) is a distribution on vectors of non-negative integers. If ji is drawn from ip{m,q), 
then: 

' k \ fc 

Pr[^i = TOi, . . . = TOfc] = I Y\ (Tj ' W mj\ (1) 



i=i 



whenever = '^5 other vectors have zero probabilitj0. We now define functions f{x) 

for X > and g{'fn) for a vector to of non-negative integers: 



1 



if a; = 0, 



fix) 



1 - if < a; < SC - B + 1, 





fc/B 



otherwise. 

iB 



1 ' 



C 

i=l i=(i-l)B+l 

We now set out some propositions that are used in the proofs. 



(2) 
(3) 



Proposition 1 For all real numbers Xi,i = 1, . . . ,k, such that \xi\ < 1 we have that: 

k k 



J|(1_^^)>1_^: 



i=0 



^We take = 1 in Eq.[T] 
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Proposition 2 (a) For all real numbers x, such that \x\ < 1 we have that: 



(b) For all real numbers x, such that \x\ <1 we have that: 

oo 

m=0 ^ ' 

(c) For all real numbers x, such that < x < 2, we have that: 

oo ^ 

x{l-x)"'m=- - 1. 

rn—O 

Proof. Proposition [2l^a) is the standard summation for an infinite decreasing geometric series. 
We obtain Proposition [S^b) by differentiating both sides of the equation in Proposition [2ta) . 
Proposition ^c) is obtained using Proposition [D^b) and is the expected vaiue of the geometric 
distribution multiphed by 1 — a; . □ 

Proposition 3 For all real numbers p and q such that < p — q < 2, we have that: 

\ m 

q \ p 



Proof. Since (1 — p) (^1 — ~ ^ — P — using Proposition [^Ja) we get that: 

oc / \ m oo _ 
n \ ' / „l=0 



p + q 



□ 



Proposition 4: (a) For all real numbers x, we have that: 

>\-x. 

(b) For all real numbers x >0, we have that: 

e-" < l-X+y. 

(c) For all real numbers Xi,i = 1, . . . , k, such that Xi < 1 we have that: 

n(i-:..)<i-E-'+Ef- 



Proof. Propositions [4fa) and|Hb) are from Taylor's series. For Propositions [4l^c) we use Propo- 
sition [TJ □ 

Proposition 5 For all real numbers x and y, such that x <1 and y > 0, we have that: 

e-^y > (1 -a;)^. 

Proof. This proposition is proved using Proposition [ifa). □ 
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Proposition 6 (a) For all real numbers x and p and integer y, such that 0<p<l,y>0 and 
x{l/p + y) = 0(1), we have that: 



y p(l - p^mx = x(--l)- 0{e-Py) 
,f^o VP J 



(b) For all real numbers x and p and integer y, such that 0<p<l, y>0 and x — 0{l), we 
have that: 



= x-0{e-Py). 



m—0 



(c) For all real numbers x, p and q and integer y, such that < p—q < 2, y > and — ^i^)^ 
we have that: 

■u 

^ n('p-(p+'^)a^ 



P + Q 

m=0 ^ ^ 



(d) For all real numbers m, x, p and q and integer y, such that 0<p — q<2, y>0 and 
p+q ^ p+q 

V 



^(^^ + y + 1) = 0(1), we have that: 



Y^p{l-prx = ^:p(L^^ _ 0(e-(^+.).). 



m=0 



Note that we are misusing tlie O notation here to hide constant factors that are independent of 
the variables in the equations. 

Proof. Using Proposition [2l[c) and Proposition [5l Proposition [BJa) is proved as follows: 

y oc oo 

J]p(l-p)™ma; = ^ p(l - p)'"ma; - (1 - p)^+i ^ p(l - p)™(to + 2/ + l)a; 

m—0 m—0 m—0 

= x(--i)-o{e-py). 



The proofs of Propositions (HKb), [S^c) and[B^d) are now trivial. □ 

The vector of random variables X — {Xi, . . .X„), is negatively associated if for every two 
disjoint index sets, I, J G [n], 

E[/(X„z G I)giX„j e J)] < E[/(X„z e /)]E[.9(X„j e J)] 

for all functions / : J?'^' 5R and / : Sft'"'' ^ K that are both non-decreasing or non-increasing. 

Proposition 7 // the random variables Xi, . . . Xj- are negatively associated, then for any non- 
decreasing function fi,i & [k], we have that: 



E[Y[f,{X,)]<Y[E[f,{X^,)] 



i=l i=l 

Proof. The proof follows directly from the definition of negatively associated variables. □ 
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TTTT 



t! l> 



17 



Di 



Figure 5: m rounds of Process "in-place" 
"other" pointers, and m + 1 accesses to C. 



Between two accesses to Di, there are m accesses to 



4.3 Cache Analysis of In-place Permutation 

In this section we analyse the cache misses in a direct mapped cache during n rounds of Process 
"in-place" , introduced in Section 14.1.11 We derive a precise equation for the expected number of 
cache misses and then give closed form upper and lower bounds on this equation. We then derive 
upper and lower bounds assuming the keys are drawn independently from a uniform distribution. 



4.3.1 Average case analysis 

We start by proving a theorem for the expected number of cache misses during n rounds of Process 
"in-place" . 

Theorem 1 The expected number X of cache misses in n rounds of Process "in-place " satisfies 
n{pc + Pd) < X < n{pc +Pd) + k{l + l/B), where: 



Pc = 



Pd = 



k/B 

^ PJ 1 - 5] P,(l - P.)'"E,^^(„^,-) 

i—l \ rn—O 

1 



and 



B - 1 
B 



(1 - 5(m)) n /(^^o 



Proof. We first analyse the miss rates for accesses to pointers Di, . . . , Dk- Fix an i, 1 < i < fc 
and a z > 1. Let /i be the random variable which denotes the number of rounds between accesses 
to locations di^z and di^z+i (/i = if these locations are accessed in consecutive rounds). Figure [3] 
shows the other memory accesses between accesses z and z + 1 to Di. Clearly, Pr[/i — m] = 
Pi{l — Pi)™, for TO = 0, 1, . . .. Let Xi denote the event that none of the memory accesses in these 
fi rounds accesses the cache block to which di^z is mapped. We now fix an integer to > and 
calculate Pr[Xi|/^ = to]. Let /i be a vector of random variables such that for 1 < j < fc, /ij is 
the random variable which denotes the number of accesses to Dj in these m rounds. Clearly p, is 
drawn from ip{m, di) (note that Di is not accessed in these to rounds by definition). 

Fix any vector to, such that Pr[/i = m] 7^ 0, and let fij be the number of accesses to pointer Dj 
in these m rounds. Since to^ must be zero, f{mi) = 1, and for j ^ i, f{mj) is the probability that 
none of the Wj locations accessed by Dj in these m rounds is mapped to the same cache block as 
location di^z [SI El]- Similarly g{fh) ■ C is the number of count blocks accessed in these rounds, and 
so 1 — gifn) is the probability that the cache block containing di^z does not conflict with the blocks 
from C which were accessed in these to rounds. As the latter probability is determined by the 
starting location of sequence i and the former probabilities by the starting location of sequences 
j,j ^ i, we conclude that for a given configuration m of accesses, the probability that the cache 
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block containing di^z is not accessed in these m rounds is (1 — g{m)) 0^=1 fi'^^j)- Averaging over 
all configurations m, we get that 



Pr[Xi I /X = m] = Efi^^(^^^.)[{l - g{nj) J| f{nj)]. 



(4) 



Finally we get, 



Pr[Xi\ = ^Pr[/x = m]Pr[Xi|/i = m] 

m=0 
oo 



m=0 



(5) 



If cli,z is at a cache block boundary or if Xi docs not occur given that di,z is not at a cache 
block boundary (Pr[Xj] does not change under this condition), then a cache miss will occur. The 
first access to a pointer is a cache miss. So other than for the first access, the probability pd of a 
cache miss for a pointer access is: 



Pd 



1 B - 
B ^ ~B 



k 

i^ft(l-Pr[X,]). 



(6) 



i=l 



Including the first access misses, the expected number of cache misses for pointer accesses is at 
most 



J2 1 + {nPi - 1) 



B 



-^(l-Pr[X,])')+i) <npd + k. 



(7) 



We now consider the probability of a cache miss for an access to a count array location. It is 
convenient to partition C into count blocks of B locations each, where the i-th count block consists 
of the locations C(j_i-|5_|_i, . . . , CiB, for i = 1, . . . , k/B. So Pi is the probability of access to the 
i-th block. We fix an i e {!,..., k/B} and a z > 1. Let ly be the random variable that denotes 
the number of rounds between the ^-th and (z + l)-st accesses to the j-th count block. We have 
that Pr[i^ = m] = Pi{l — Pi)™, for m = 0, 1, . . .. Let Yi denote the event that none of the memory 
accesses in these m rounds accesses the cache block to which the i-ih count block is mapped. 

We now fix an integer m > and calculate Pr[yi|z/ = m]. Let be a vector of random variables 
such that for 1 < j < k, Vj is the random variable which denotes the number of accesses to Dj 
in these m rounds. Given that k < BC and assumption (c) mean that two blocks from C cannot 
conflict with each other. As the pointers -D(i_i)B+i, . . . ,-Dis will not be accessed between two 
successive accesses to count block i, the probability of accessing pointer Dj is given by 6*- and 
ip{m, hi) is the distribution for v. Arguing as above: 



Vx[Yi] = J2 P^t*^ = H Pr[Fi|iy = m] 

m=0 

oo 



m=0 



i=i 



(8) 



The first access to a count array block is a cache miss, for all other accesses there is a cache 
miss if event Yi does not occur. So other than for the first access, the probability Pc of a cache 
miss for a count array access is: 



k/i 



Pe = ^P,(l-Pr[r,]). 



(9) 
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Including the first access misses, the expected number of cache misses for count array accesses 
is at most 

k/B 

1 + inP, - 1)(1 - Pr[r,]) < npc + k/B. (10) 

i=i 

Plugging in the values from Eq.[5]into Eq.[7|and from Eq.[5]into Eq.fTUlwe get the upper bound 
on X, the expected number of cache misses in the processes. The lower bound in Theorem [1] is 
obvious. 

□ 



4.3.2 Upper bound 



We now prove a theorem on the upper bound to the expected number of cache misses during n 
rounds of Process "in-place" . 

Theorem 2 The expected number of cache misses in n rounds of Process "in-place" is at most 
n{pd + Pc) + k{l + 1/B), where: 



Pd 



1 k B -1^ 1^ 



< 



B • BC • BC ^\^^ p~+¥j B ^ p,+p, 

1=1 \] = 1 3 = 1 ■' , 

k B -1 P,pj 

B^C BC ^,^P^+PJ' 

1=1 7 — 1 



Proof. In the proof we derive lower bounds for Pr[Xi] and Pr[yi] and use these to derive the 
upper bounds on pd and Pc- 

Again, we consider a fixed i and consider the event Xi defined in the proof of Theorem [1] We 
now obtain a lower bound on Pr[Xj]. 



Lower bound on Pr[Xi] 

Letting T{x) = 1 — f{x) and using Proposition [1] we can rewrite Eq. [5]as 

k 

PliXi] > ^ Pr[^ = m]E^^^(„,a,) 



m=0 



(11) 



We know that the j-th count block contributes 1/C to g{fi) if there is an access to that block 
and Pr[7-th count block accessed|/i = m] = 1 — (1 — cj)™, where c* — j^^- So we have that, 

k/B 



and we get, 



k/B 



EPr[/^ = ™]Ep~^(^,,.)[g(/2)] = ^ p,(l - p,)™ ^ 1(1 - (1 - 



ro=0 



m— 

k/B 



3 = 1 



c 



j=l m— 



and using Proposition [3] we get. 



k/B 



E Pr[M = ?7i]E^^<^(,„,a,)[5(M)] = 77 E 



p. 



m=0 



C^v, + P, 

3 = 1 ■' 



(12) 
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We now evaluate 



Pr[/i = m] ^E^^^(„,a.)[r(Mi 



m=0 



Our approach is to first fix j and evaluate E^^;^(„ (j.-|[r(/ij)]. For m < BC, we know that 
E;i^y(m,a,)[r(Aij)] = ^Pr[/ij = /] — — Pr[^j = 0]- 



1=0 



BC 



The last term is due to the fact that T{x) is discontinuous and r(0) = 0. Similarly for m > BC 
we know that 

I ^ ^ 



l=BC-B + l 



l + B-l 
BC 



- 1 



The last term is due to the fact that r(a;) = 1 for x > BC — B + 1. If we drop this last term when 
m > BC, we get that for all m 



The summation term is the expected value of the random variable with the binomial distribution 
b{l; m, a*). So we get that 



E^^y(™,a,)[r(Aij)] < ma] + {B - 1) (l - {l - a 
We now evaluate Y.m=o ^^if^ = "^1 ^j=i Em~v('",o.)[E(/^j)] 

m— j—1 



(13) 



< 



OO /c ^ 

E pr[A^ = H E H + (5 - 1) (i - (1 - 



Since "^'^j — m, we get X]m=o P''!/" = "^'^j = ^ ~ 1 by an application of Proposi- 

tion[2I^c). By applying Proposition [3] we get that J2m=o P^if^ — Tn]{B — 1)(1 — (1 — a*)'") = 

(S-l)E-=i^- So we get: 



E Pr[Ai = E E/i~>^(m,a.)[r(Aij)] < 



m=0 



BC 



(5-i)E 



^ Pi + Pi 



Substituting Eg. [T2l and [Til in Eq. [TT]we obtain the following lower bound for Pr[Xi 
Pr[X,] > L l + 



(14) 



(15) 
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Upper bound on pd 

Finally, substituting Pr[Xi] from Eq. [T5]in Eq. [S]we get: 



Pd 



< 



1 

B-1 
B 



k/B 



C^v, + P, EC 



1 , iB-l)k B-l''^ p,P, (B~l)' '' 
B B^C BC ^^^^p, + P^ B^C ^^^^p,+p, 



^^Pt +Pj 

PiP] 



< 



B - 1 



EE 



p,P, B-l 



B BC BC ^\^v, + P, B ^v,+p, 



E 



piPj 



We can evaluate Pc using a very similar approach, as sketched out now. We again consider a 
fixed i and consider the event Yi defined in the proof of Theorem [TJ We now obtain a lower bound 
on Pr[yi]. 

Lower bound on Pr[yi] 

Again letting T{x) = 1 — f{x) and using Proposition [1] we can rewrite Eq. [5]as: 



i-Er(^.) 



Arguing as for the derivation of Eq. [T21 we get 

^.^^(rnMin^i)] < \rnh^ + (S - 1) (l - (1 - 6}) 
Then arguing as for the derivation of Eq. [TH we get 

oo A; 



J2 = ™]EEp~^(",b.)[r(^,)] < 

m=0 j=l 

Substituting this into Eq. [111 we get: 



1 / 1 



(^-i)E 



pj 



BC \P, ' '^P,+pj 



(16) 



Pr[l»] > 1 - 



- - 

BC P, 



Pj 



j-{P^+P0 



(17) 



Upper bound on pc 

Substituting Pr[yj] from Eq.[T7]in Eq.[9]we get 

k/B 

P. 

BC \P 



Pc 



< 



k/B k 

B^C BC 



Pj 



^P^+Pi, 



P^Pj 



□ 

This proves the upper bound for the equation in Theorem [TJ We now prove a lower bound on that 
equation. 
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4.3.3 Lower bound 

Theorem 3 When pi > 1/C then the expected number of cache misses in n rounds of Process 
"in-place" is at least npd + k, where: 

1 fc(2C-fc) fc(fc-3C) 1 k 

^'^-'B ^ 2^2 + 2BC2 2BC ^ 252(7 

^ B(fc-C) + 2C-3fc ^^ (k)2 



+ 



{B-l) 



BC^ 

2 ^ 

-El 

i=l 



E 



2 — 1 J — 1 

-p,; -Pj) B-l 



EE 



Proof. We again consider a fixed i and consider the event Xi defined in the proof of Theorem[TJ 
Let /2 be as defined in the proof of Theorem [T] We now obtain an upper bound on Pr[Xi]. 

Upper bound on Pr[Xi] 

In [3] it is shown that the variables /ij are negatively associated [5^. Noting that f{x) is a non- 
increasing function of x, then using Proposition [7] we have that: 

k k 

So we can re- write Eq. [5] as: 

BC-B k 00 

m=0 j=l m=BC-B + l 



(18) 



We first bound the last term. We know that 



J2 PT[^i = m] - (l-K)^^-^+i^Pr[A* = 



m=BC-B+l 



BC-B+1 



Using Proposition [5] we get that (1 - Pi)^^^^+^ < e-(^<^-^+i)P'. Assuming pi > 1/C the last 
term is at most 0{e~^). 

We now bound the first term in Eq. [THl We use an approach similar to the derivation of Eq. [T^ 
and since /i < BC — B, so fij < BC — B, we don't have to drop any terms in the simplification, 
so we get that: 

E^.^(,„,a-,)[/(A*,)] = 1 - + iB- m - (1 - «;■)"))• 

Letting tj{m) ~ -^{ma^j + {B — 1)(1 — (1 — a*)™)) and using Proposition dja) we get that 
g-E, *^('") > _ tj{m)). So we have that 

BC-B 



m— 

BC — B , ^, > 
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Using Proposition IDJb) and letting Pj = (1 — a)) we get that: 



BC-B 



m=0 
1 



{B-l)k {iB-l)k)^ 



BC 



2(BC)2 2(SC)2 

k 



2{BCY 



k k 



2m^/3f + 2(i?-l)fc^/3™ 

j=i i=i 

+ 0(e-^). 



(19) 



We now evaluate the terms in Eg . [T9l assuming that pi > 1/C, so fc < C. For the simplifications 
of the subtractive terms we use the fact that e^Pi(^C'^^+i) <^ . 

Since Pi > 1/C, {1/pi + BC — B)/{BC) = 0(1), so using Proposition [SJa), we get that 



m=0 



Since > 1/C, (i? — l)fc/(i?C) < 1, so using Proposition [HJb), we get that 



m— 

We now evaluate the term 

a-l 



BC 



BC 



(20) 



(21) 



^ ' m=0 j = l 



(B-l) 



[Bcy ^ ^ 

^ ' m=0 j = l 

{B -I) ^ p,{l-pi -pj) 



aiB-l) { ^ P^{l ~ P^ - Pj) 



{BCf 



E 



1 (pi+pj)^ 



E 



t P^+P] 



[B -I) p,{l - Pi ~pj) 
{BCf ^ {P.+P,? 



-0(e-^). 



(22) 



The last simplification is due to pi > 1/C and fc < C, so X]i<j<A;(Pi(l ~ Pi ~Pj))/{Pi ^ 



fcC < C2 and Y.l<,<ki^P^)l{P^+PJ) < kCB < BC\ 

Substituting back (1 — a*) — j3j and using Proposition [3] we get that 



(U 1 ^2L BC-B k 

E pr[A^ = -]E/?r 

V ^ n j-^i 



m=0 
(B-l)2fc 



{BCf j^^P.+P, 



.(S-l)2fc 



(i?C)2 ^p.+p, 
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The last step used Ei=iK/(P» + Pj) < ^ and {{B - l)kfl{BCf) < 1. 
We now evaluate the additive terms, starting with 

BC-B 



y Pr[u = „,i!^^M^<(^:iI)^fl_i 



m—O 

Since m < BC we now get that 

BC-B 

'2(BC)2 - 2BC \p. 



E m^ 1/1 

m— 



Substituting back (1 — aj) = /3j and using Proposition [3] we get that: 

P -I BC-s k n 1 



BC " ^^'^ - 

m—O j—O j — 1 

Finally we evaluate X!m^o P^^Im — ^] X!f:^i Pj^Pi'^^ first evaluating 

(l-p,)/3jA = (l-pO(l-a;)(l-an 

_ 1 - - - Pi + K(pt + + Pi) + PjPi 

Using this result and Proposition [21[a), we get that 

a-l 

J2 Pr[M = m]P,"^(3r 



m—O 

oo 
m=0 



a-p^) J 

Pi 



< 



Pt + Pj +Pl - PjPl/il - Pi) 
Pi 



Pi + Pj +Pl - PjPl 



So we get that 



^^Epr[M--]EE/^r/3- 

^ ' m=Q 1 = 1 1=1 



2(-BC')^ ~{fr(Pi+Pj +Pi - PjPi ' 
Lower bound on pd 

Plugging Eqs. [20], ..,[271 into Eq.[Tn|we get that: 

Pr\X] < 1 I fl A iB-l)k iB-l) ' p,{l-p,-p,) 
^ - Bc[p, V BC (BC? f^^ {P.+Pj? 

i^B - Ifk ' p. {B-l)k fl \ 1 fl 
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B - 1 



^ T). -I- 



^2 



j^iP^^ Pi 2(SC)2 ^ ^ Pi + Pi + P; - PjPi 
((S-l)fc)' 



2(BC)2 

Plugging Eq. [^H] into Eq. [H] we get 



+ 0(e-). 



1 B - 1 A 
Pd > — H — Z^Pi 



B B 



1 

1 

Pi 



2BC 



2{B - l)k 
BC 



p, B-1 f {B-l)k 



^v,+pj BC V BC 
iB-l)k (^_{B-l)k 



BC 



2BC 



(B -1) y-^ P»(l ~Pt "Pj) 

{Bcy ^ (p,+P,)2 



EE 



(28) 



Simplifying further and using J2i=i Hi=\Pi' liP^ +Pj) < ^ ai^d 2i=iPi(l/Pi - 1) = - 1, we 



get: 



k(2C-k) k{k-3C) 



1 



k 



2C2 



2BC^ 2BC 2B^C 

k k 



B{k-C) + 2C-3k^^ {p,f 



{B-l) 
B^C^ 



BC^ 

2 k 



E 



■EE 

P»(l -Pz -Pj) _ B-l 
- (p.+P,)2 2 



k k 

EE 

^ ^ p» + Pj + Pi - PiP/ 



□ 



4.3.4 Upper and lower bounds for uniformly random data 



Using the upper and lower bound Theorems just proven for general probability distributions, we 
now derive Corollaries for upper and lower bounds for uniform distribution. 



Corollary 1 If pi = 

"in-place" is at most 



Pk = l/k then the number of cache misses in n rounds of Process 
k 



1 k{B + 5) 
B ^ 2BC 



B^C 



fc 1 



Proof. Since Pi in pc and Pj in pd are both B/k in the equations in Theorem [21 we get that: 



Pd+Pc < 7^ 



< 



2{B - 1) fc2 B/k 



BC B B 
2{B~1) k 



{B-l)\,l/k 



BC 
3 
B 



k 

k{B + 5) 
2BC 



B 
B 



1 B^C 
{B~lf k 
B^C 2 
k 



B^C 



B^C 
k 

'bc 



k 

BC 



2B 



B^C 



B^C 
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□ 



Remark: As we will see later, Process "in-place" models the permute phase of distribution sorting 
and Corollary [T] shows that one pass of uniform distribution sorting incurs 0{n/B) cache misses if 
and onlyif fc = 0(C/S). 

The following corollary is from the lower bound result in Theorem [31 



Corollary 2 If pi = 

"in-place" is at least: 



Pk 



k 

2C 



= l/k then the number of cache misses in n rounds of Process 



BC^ 2BC 

Proof. Plugging = 1/fc in the equation in Theorem [3] we get that: 



Pd > 



1 

B 



k{2C-k) k{k-3C) 1 k 

2C72 ^ 2BC^ 2BC ^ 2B^C 
B{k -C) + 2C-3kk {B- 1)2 I" {k - 2)k 
2 ^ B3C2 



1 

> — 

- B 



k 

2C 



BC2 

fc2 k 



B - 1 



fc3 



3fc - 1 



1 



fc , {B-lf 

BC^ 2BC 2B^C 12B3C2 



{P{b-2B) - Ik + 2) 



□ 

Remark: From Corollary [T] we have that for uniformly random data and k = aC, where a < 1, 
other than for small values of B, the upper bound for the number of cache misses in n round is 
roughly 

an 

and from Corollary [5] we have that for uniformly random data and k = aC, where a < 1, other 
than for small values of B, the lower bound for the number of cache misses in n round is roughly 

6 

The ratio between the upper and lower bound is 3/(3 — a). So we have that for uniformly 
random data the lower bound is within a factor of about 3/2 of the upper bound when k < C and 
is much closer when k <^ C. 



4.4 Cache Analysis of Out-of-place Permutation 

In this section we analyse the cache misses in a direct mapped cache during n rounds of Pro- 
cess "out-of-place" , introduced in Section 14.1.21 We derive a precise equation for the expected 
number of cache misses and closed-form upper and lower bounds. During the analysis we re-use 
Pi, Pi, a, b, f{x) and g{ni) introduced in Section [Ol 



4.4.1 Average case analysis 

We start by proving a theorem for the expected number of cache misses during n rounds of Process 
"out-of-place" . 

Theorem 4 The expected number X of cache misses in n rounds of Process "out-of-place" is 
n{pc + pd+ Ps) < X < n{pc +pd +Ps) + k{l + 1/B) + 1, where: 

k/B / oo 

p, = j2pAi-Y.p,{i-p,rE,^^^^,s^) 

i=l \ m=0 



k 

f{m + l)l[f{i 
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c : i : 

mi 



V f! t.' I' 

Di 

Figure 6: m rounds of Process "out-of- place" . Between two accesses to Di, there are m accesses 
to "other" pointers, and m + 1 accesses to C, and to + 1 accesses to consecutive locations in S. 



1 B -I 

B+^ 



i=l 



i-^p.(i-KrE^ 



m=0 

1 B-l ( 
B^^V-V-C 



(l-5(/i))/(m + l)n/(M,) 




Proof. We first analyse the miss rates for accesses to pointers Di, . . . ,D]^. Fix an i, 1 < i < fc 
and a z > 1 and consider the probability of a miss between access 2: and z + 1 to pointer Di. 
We define /j,, /Zj, to, /i, m, z, ip{m,di) and as in the proof of Theorem [T] Again f{mj) is the 
probability that none of the rrij locations accessed by Dj in m rounds is mapped to the same cache 
block as location di^z- Similarly g{ffi) ■ C is the number of count blocks accessed in to rounds, 
and so 1 — g{in) is the probability that the cache block containing di^z does not conflict with the 
blocks from C which were accessed in these to rounds. We also have accesses to to + 1 contiguous 
locations in S and /(to + 1) is the probability that these to + 1 accesses are not to the cache block 
containing di^z- Figure [5] shows the other memory accesses between accesses z and z + 1 to D^. 

For a given configuration to of accesses, as the probabilities f{mj), gim) and /(to + 1) are 
independent, we conclude that the probability that the cache block containing di_z is not accessed 
in these to rounds is {1 — g{rh))f{m) Y['j=i fi'i^^'j)- Averaging over all configurations to, we get that 



Pr[Xi \ fi — m 
Using which we get, 

00 

Pr[X,] - J2Pr[^ = m] Pr[X, |^ = to] 

00 



(29) 



m=0 



(l-,g(A2))/(TO + l)n/(M,) 



(30) 
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Arguing as for Eq. [Hlwe get that, other than for the first access, the probability pd of a cache 
miss for a pointer access is: 



Pd = -^ + ^^^p,(l-Pr[X,]). (31) 

1=1 

Including the first access misses, the expected number of cache misses for pointer accesses is at 
most 

^l + (nK-l)(^(^:^(l-Pr[X,])^+i^ <npd + k. (32) 

We now consider the probability of a cache miss for an access to a count array location. Fix 
an i G {1, . . . , k/ B} and a z > 1 and consider the probability of a miss between access z and z + 1 
to count block We define v, Vj, m, P, to, Pj, </j(to, 6^), and Yi as in the proof of Theorem[T] 

Again, given that k < BC and assumption (c) mean that two blocks from C cannot conflict 
with each other. So we need to determine the probability of a conflict given rrij accesses to the 
pointer Dj, for all j G {1, . . . , A:}, and to accesses to contiguous locations in S. Again f{mj) is 
the probability that none of the rrij locations accessed by Dj in to rounds is mapped to the same 
cache block as and /(m + 1) is the probability that the accesses to to + 1 contiguous locations 
in S are not to the same cache block as 

So we have: 



Pr[Y^ = Pr[v = to] Pr[y,|z/ = to] 



k 

/M n /(^^o 



(33) 



Arguing as for Eq. [9l the probability pc of a cache miss for a count array access is: 

k/B 



= 5]P,;(1-Pr[r,]). (34) 



Including the first access misses, the expected number of cache misses for count array accesses 
is at most 

k/B 

1 + {nP^ - 1)(1 - Pr[F.]) < np, + k/B. (35) 

i=l 

We now calculate cache misses for accesses to the array S. We consider the probability of a 
cache miss between accesses to S[s] and S[s + 1]. Wc know that there is exactly one access to 
a count block and one access to a pointer between two accesses to S. The probability that the 
pointer access is to the same cache block as S[s] is 1/C. The probability that a block from C maps 
to the same cache block as S[s] is k/ BC. Given that a block from C maps to the same cache block 
as S[s\, the probability that the access to the count array is to the same cache block as S[s\ is 
B/k. So the probability that the pointer access is to the same cache block as S[s\ is also 1/C. So 
the probability that there are no memory accesses to the cache block that S[s\ is mapped to before 
the access to S[s + 1] is 

{l-l/Cf. 

We have a cache miss if S[s\ is at a cache block boundary, otherwise the probability of a cache 
miss is 1 — (1 — 1/C)^. So the probability ps of an cache miss for an access to S is 

1 B 



B B 
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The first access to S is always a cache miss, so the expected number of cache misses in accesses 
to S is: 

nps + 1. 

Plugging in the values from Eq. [30] into Eq. [321 and from Eq. [33] into Eq. [35] we get the upper 
bound on X , the expected number of cache misses in the processes. 
The lower bound in Theorem [3| is obvious. 

□ 



4.4.2 Upper bound 

We now prove a theorem on the upper bound to the expected number of cache misses during n 
rounds of Process "out-of-place" . 

Theorem 5 The expected number of cache misses in n rounds of Process "out-of-place" is at most 
n{pd + Pc + Ps) + k{l + 1/B) + 1, where: 



Pd 



< 



1 2{B-l)k 5-1 p,Pj 

B B'^c ^ Rr 2^ 2^ 



BC ^^ p, +Pj 



k k 



< 



B^C 
2k B-1 



1 + 



ptPj 



B^C BC 
1 B-1 ' 



^=^ ] = 

k/B k 

1 + 



p^p, 



tUP.+P.r 



Ps = -f^ 



B B 



Proof. As for the upper bound for in-place permutation, in this proof we derive lower bounds 
for Pr[Xi] and Pr[yi] and we will use these to derive the upper bounds on pd and pc- We make 
extensive use of the results obtained during the proof of Theorem [TJ 

Again, we consider a fixed i and consider the event Xi defined in the proof of Theorem [J] We 
now obtain a lower bound on Pr[A'i]. 

Lower bound on Pr[Xi] 

Letting T{x) = 1 — f{x) and using Propositioi^T] we can rewrite Eq. [S]as: 



Pr[X^] > ^ Pr[^ = TO]E^^y(„j^Q;.) 

m— 

We can use Eq. \T7\ as a simplification for 

oo 

^ Pr[^ = m]Ep^^(„_a,)[g(/2)], 

ro=0 

and Eq. [14] as an upper bound on 

oc 

E P'^tA' = 

m=0 

So we just have to evaluate 

oc 

^ Pr[^ = m]Ep^^(„j_a-^)[r(m + 1)] 



(36) 



E^^<p(,„^ai)EP(^j)]- 

i=i 



m=0 
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Since we always have at least one access to 5, we have that 



^ Pr[^ = m]E^^^(„_„-^)[r(/^ + 1)] = ^ Pr[/i = m] 



m=0 



m=0 



m + B 
BC 



E„ r ^ f m + B 
Pr[^ = m] I 1 



m=BC~B 
oc 



< 



1 

BC 



m=0 

BC 



BC 

Prlfi = m]m + B 
l + B 



(37) 



where the last simplification used Proposition ^c) . Substituting Eq. [T21 Eq. [T3] and Eq. [37] in 
Eq. [3niwe obtain the following lower bound for Pr[Xi] 

k/B 



.7 = 1 



1 

BC \p. 



(5-1) 1+E 



(38) 



Upper bound on pd 

Finally, substituting Pr[Xi] from Eq. [3S]in Eq. [S]we get 



Pd 



< 



1 B -I 



B B 



E 



Pi 



1_ 2{B - l)k B-1 ^'^ p,Pj 
B ^ BKJ ^ BC 



1=1 j=i 



P^ + Pj 



B-^C 



k k 

1+EE 



piPj 



We can evaluate pc using a very similar approach to that used in the proof of Theorem [2] We 
again consider a fixed i and consider the event Yi defined in the proof of Theorem [4l We now 
obtain a lower bound on Pr[yi]. 

Lower bound on Pr[Ki] 
We can rewrite Eq. [33las: 



Pr[K,] > ^ ¥v[v = TO]Ep 



i-r(TO + i)^r(^.,) 



(39) 



Eq. [T7] gives us 



m— i— 1 
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Arguing as for Eq. [37] 

°° 1 / 1 \ 

^Pr[i. = m]E,^^(„,fc-)[r(m + l)] < —i--l + B\. (40) 

m=0 ^ ' ^ 

Substituting Eq . [T71 and Eq. UOlin Eq. [3S]we obtain the following lower bound for Pr[Xi]: 
Upper bound on pc 

Finally, substituting Pr[yi] from Eq. [4T]in Eq. [9]we get: 

''■/'^ 1 / o / \ \ 



1=1 



BC \P, ' '\ ^P,+p, 



□ 



4.4.3 Lower bound 

It is quite obvious that the lower bound for in-place permutation, given in Theorem [3l is a lower 
bound for out-of-place permutation. 

4.4.4 Upper and lower bounds for uniformly random data 

Using the upper bound Theorem just proven, we now derive a Corollary for an upper bound to 
the number of cache misses if the data is uniformly distributed. 

Corollary 3 If pi = ... = Pk = then the number of cache misses in n rounds of Process 
"in-place" is at most: 

(\ k{B + 3) k k \ , / 1 
^B+^WC^ + Wc + Bcj+'V+B 

Proof. Since Pi in pc and Pj in pd are both B/k in the equations in Theorem[51 we get that 

v. + v+v < 2 2{B-l) k^ B/k [B-lf l/k 
Pd+Pc+Ps < ^+ BB + 1^ B^C 2 

B \ C2 
{B-lfk 



< 



2k 


2(B - l)fc 


^ B^C ^ 




2 2(5-1 k 
h -^^ 1 


B BC B+l 


2k 


2(B-l)k 


^ B^C ^ 




2 k 


■4 B-l' 


B^C 


B^ 2B 


2 ^ k{B + 7) ^ 2k 


B ^ 2BC ' 



B^C 2 
? - 12C- 
~B 02" 

2k 2 



B^C C 
2 



□ 
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Remark: Corollaries [T] and [3] shows that for uniformly distributed data, other than for small 
values of B, the number of cache misses during in-place and out-of-place permutations are quite 
close. As for an in-place permutation, one pass of uniform distribution sorting using out-of-place 
permutations incurs 0{n/B) cache misses if and only ii k — 0{C/B). 

Using Corollary [2] for the lower bound and Corollary [3] above, we see that when k < C the 
lower bound is again within 3/2 of the upper bound and is much closer when A; <C C. 

4.5 Cache Analysis of Multiple Sequences Access 

Accessing k sequences is like Process "in-place" in Section |4 . 1 . 1 1 except that there is no interaction 
with a count array, so we delete step 2 and assumption (c) . An analogue of Theorem [T] is easily 
obtained. An easy modification to the proof of Theorem [2] gives: 

Theorem 6 The expected number of cache misses in n rounds of sequence accesses is at most: 

^ b^c + B^c y2^^p,+p^j- 

Corollary 4 If pi = ... = pk = ^/k then the number of cache misses in n rounds of sequence 
accesses is at most: 

\B 2BC J 

Remark: From Corollary [4l k = 0{C/B) random sequences can be accessed incurring an optimal 
0{n/B) misses. This essentially agrees with the results obtained by Mehlhorn and Sanders [9] and 
Sen and Chatterjee [M]. 

Remark: Since its derivation ignored the effects of the count array, the lower bound in Theorem [3] 
applies directly to sequence accesses. Note that the lower bound we obtain for uniformly random 
data, as stated in Corollary [51 is sharper than the lower bound of 0.25(1 — e^"-^^''"/'^) obtained 

in [g. 

Remark: Our upper and lower bounds are also closer than those in The analysis in [3] assumes 
that accesses to the sequences are controlled by an adversary; our analysis demonstrates, that with 
uniform randomised accesses to the sequences, more sequences can be accessed optimally. 

4.6 Correspondence between the processes and the permute phase 

We now show how the Processes "in-place" and "out-of-place" model the permute phase of a 
generic distribution sorting algorithm. 

The correspondence between Process "in-place" of Section [470] and the pseudocode in Figure [2] 
is as follows. Each iteration of the inner loop (steps 3.1-3.5) of the pseudocode corresponds to a 
round of Process "in-place". The array COUNT corresponds to the locations C, and the pointer Di 
points to DATA[idx] . The variables x in the process and the pseudocode play a similar role. It can 
easily be verified that in each iteration of the loop in the pseudocode, the value of x is any integer 
1, . . . , A; with probability pi, . . . ,pk, independently of its previous values, as in Step 1 of Process 
"in-place" . A read at a location immediately followed by a write to the same location is counted 
as one access. Thus, the read and increment of COUNT [x] in Steps 3.2 and 3.3 of the pseudocode 
constitutes one access, equivalent to Step 2. Similarly the "swap" in Step 3.5 of the pseudocode 
corresponds to the memory access in Step 3 of the process. The process does not model the initial 
access in Step 1 of the pseudocode, and nor does it model the task of looking for new cycle leaders 
in Steps 4 and 5 of the pseudocode. 

The correspondence between Process "out-of-place" of Section 14.1.21 and the pseudocode in 
Figure[T]is as follows. The array COUNT corresponds to the locations C, the array DATA corresponds 
to the locations S, i in the pseudocode corresponds to s in the process, and the pointer Di points to 
DEST lidx] . The increment of i in the pseudocode is equivalent to the increment of s in the process, 
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and the accesses to DATA[i] and S[s] are equivalent. As above, the variables x in the process and 
the pseudocode in the pseudocode play a similar role. Again, the read and increment of COUNT [x] 
of the pseudocode constitutes one access, and is equivalent to Step 3 of the process. The access to 
DEST[icix] of the pseudocode corresponds to the memory access in Step 4 of the process. 

Assumption (b) of the processes is clearly satisfied and assumption (c) can normally be made 
to hold. Assumption (d) and k < CB or k < C may not hold in practice, in [TI] we give an 
approximate analysis which deals with this. Assumption (a) of the processes, that the starting 
locations of the pointers Di arc uniformly and independently distributed, is patently false, we 
discuss this in more detail in [llj . We may force it to hold by adding random offsets to the starting 
location of each pointer, at the cost of needing more memory and adding a compaction phase 
after the permute, this has also been suggested by Mehlhorn and Sanders [5]. This only works 
if the permute is not in-place, and if k is sufficiently small (e.g. k < n/{CB)). In TT] we study 
assumption (a) empirically in the context of uniform distribution sorting. Another weakness is 
that our processes are continuous, so the sequence lengths are not specified, whereas in distribution 
sorting we sort n keys and each sequence is of a finite length. 

5 MSB radix sort 

We now consider the problem of sorting n independent and uniformly-distributed floating-point 
numbers in the range [0,1) using the integer sorting algorithm MSB radix sort. As noted earlier, 
it suffices to sort lexicographically the bit-strings which represent the floats, by viewing them as 
integers. One pass of MSB radix sort using radix size r groups the keys according to their most 
significant r bits in 0{2^ + n) time. For random integers, a reasonable choice for minimising 
instruction counts is r = [logn — 3] bits, or classifying into about n/8 classes. Since each class 
has about 8 keys on average, they can be sorted using insertion sort. Using this approach for this 
problem gives terrible performance even at small values of n (see Table [ij . As we now show, the 
problem lies with the distribution of the integers on which MSB radix sort is applied. 

5.1 Radix sorting floating-point numbers 

A floating-point number is represented as a triple of non-negative integers (i, j, I). Here i is called 
the sign hit and is a 0-1 value (0 indicating non-negative numbers, 1 indicating negative numbers), 
j is called the exponent and is represented using e bits and I is called the mantissa and represented 
using m bits. Let j* — j — 2^~^ + 1 denote the unbiased exponent of {i,j,l). Roughly following 
the IEEE 754 standard, let the triple (0, 0, 0) represent the number 0, and let (i, j, I), where j > 0, 
represent the number ±2-' (1 + ^2"™), depending on whether a; = or 1; no other triple is a 
floating-point number. Internally each member of the triple is stored in consecutive fields of a 
word. The IEEE 754 standard specifies e = 8 and ni — 23 for 32-bit floats and e = 11 and m = 52 
for 64-bit floats [J. 

We model the generation of a random float in the range [0,1) as follows: generate an (inflnite- 
precision) random real number, and round it down to the next smaller float. On average, half 
the numbers generated will lie in the range [0.5, 1) and will have an unbiased exponent of —1. 
In general, for all non-zero numbers, the unbiased exponent has value i with probability 2', for 
i — —1, —2, . . . , —2"^^^ + 2, whereas the mantissa is a random m-bit integer. The value has 
probability 2^^ Clearly, the distribution is not uniform, and it is easy to see that the average 

size of the largest class after the flrst pass of MSB radix sort with radix r is n (^1 — ^^f^ttt^ if 

r < e -M, and n/(2'-^) if r > e 1. 

This shows, e.g., that the largest sub- problems in the examples of Table [1] would be of size 
n/2r'°s"~3l-ii 2i4, so using insertion sort after one pass is inefflcient in this cas^. To get down 
to problems of size 8 in one pass requires a radix of about logn -I- 8, which is impractical. Also, 
MSB radix sort applied to random integers has 0(n) expected running time independently of the 
word size, but this is not true for floats. A flrst pass with r ^ e barely reduces the largest problem 

^In fact, the total number of keys in all sub-problems of this size would be n/2 on average. 
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size, and the same holds for subsequent passes until bits from the mantissa are reached. As the 
radix in any pass is limited to logn + 0(1) bits, we may need f2(e/logn) passes, introducing a 
dependence on the word size. 



5.2 Using Quicksort 

To get around the problem of having several passes before we reduce the largest class, we partition 
the input keys around a value 1/n < 9 < l/(logri), and sort the keys smaller than 9 in 0{ti) 
expected time using Quicksort. We then apply MSB radix sort to the remaining keys. Let e' = 
min{ [log log(l/^)] , e} denote the effective exponent, since the remaining keys have exponents which 
vary only in the lower order e' bits. This means that keys can be grouped according to a radix 
r = e + 1 + m' with m' > in 0{n + 2*^ +™ ) time and 0(2'^ "'"™ ) space. Since e' = O(loglogri), 
we can take up to logn — O(loglogn) bits from the mantissa as part of the first radix; as all sub- 
problems now only deal with bits from the mantissa they can be solved in linear expected time, 
giving a linear running time overall. 



5.3 Cache analysis 

We now use our analysis to calculate an upper bound for the cache misses in the permute phase of 
the first pass of MSB radix sort using a radix r = e+l + m' , for some m' > 0, assuming also that 
all keys are in the range [9, 1), for some 9 > 1/n. There are 2^ pointers in all, which can be 
divided into g = 2"^ groups oi K — 2™ pointers each. Group i corresponds to keys with unbiased 
exponent —i, for i — 1, . . . , g. All pointers in group i have an access probability of 1/{K2^). Using 
Theorem [1] and a slight extension of the methods of Theorem [2] we are able to prove Theorem [7] 
below, which states that the number of misses is essentially independent of g: 

Theorem 7 Provided gK < CB and K < C the number of misses in the first pass of the permute 
phase of MSB radix sort is at most: 

" ^ i§ ^^'^^ + 2 logs + logC - logK + 0.7)^ + gK{l + 1/B). 
Proof. Using Eq. [iniwe can calculate an upper boimd on the probability of event Ar(i_i)x+i as: 



- BC C ^ ^ B2-3 + 2-* BC 2-3 + 2"* 

■j=i 1=1 j=i 1=1 

= 1-;^ Es2?T^ + 2^ + (^-i)E2^rT^ 

> ^-^(^^sB + ^ + T + {B-l)j2^^^^^\. (42) 

If K2^/{BC) > 1 then Pr[X(i_i)7f+i] would be negative, so we place a bound on this term such 
that KT < BC. The maximum value of i such that K2y{BC) < 1 is logSC - log X - 1. 

Since the probabilities of access to pointer . . . , -Djx are all 1/{K2''-) we can calculate 

an upper bound on pd using Eg. and 1421 as: 



pa < ^{Y.P^{^ogB + ^ + {B-l)Y^^—^]+ ^ p^' 
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n — 


1 X 10" 


2 X 10" 


4 X 10" 


8 X 10" 


16 X 10" 


32 X 10" 


MTQuick 


0.7400 


1.5890 


3.3690 


7.2430 


15.298 


32.092 


Naive 1 


7.0620 


14.192 


28.436 


57.082 


115.16 


233.16 



Table 1: Memory-tuned Quicksort and Naivel MSBRadix. Running times in seconds of memory- 
tuned Quicksort and Naivel MSBRadix sort (single pass MSBRadix sort without partitioning, 
r — [logn — 3])floating point keys. 
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i=loeBC-logK 



1 K 



2 -J 



i=l 

logBC-logK-1 



2^ BC ^ 2* 

i=\ogBC~\ogK 



^ 1 

F - 



1=1 

9 



SC I ^ 2 

2=1 



i=l 



2« ^ + 2- 

i=i j=i 



2K 

+ —{logBC-logK-l) + — 
< (2 logs + 3 + logC- log + 2.3(5- 1)). 



□ 



5.4 Tuning MSB radix sort 

We now optimise parameter choices in our algorithms. The smaller the value of 9, the fewer keys 
are sorted by Quicksort, but reducing 9 may may increase e' . A larger value of e' does not mean 
more misses, by Theorem [71 but it does mean a larger count array. We choose 9 = l/(logri)^ as a 
compromise, ensuring that Quicksort uses o(n) time. Using the above analysis we are also able to 
determine an optimal number of classes to use in each sorting sub-problem. We use two criteria 
of optimality. In the first, we require that each pass incur no more than (2 + e)n/B misses for 
some constant e > 0, thus seeking essentially to minimise cache misses {2n/B misses is the bare 
minimum for the count and permute phases). In the second, we trade-off reductions in cache misses 
against extra computation. The latter yields better practical results, and results shown below are 
for this approach. 

5.5 Experimental results 

Table [2] compares tuned MSB radix sort with memory-tuned Quicksort [8] and MPFlashsort TT], a 
memory-tuned version of a distribution sorting algorithm which assumes that the keys are indepen- 
dently drawn from a uniformly random distribution. The algorithms were coded in C and compiled 
using gcc 2.8.1. The experiments were our Sun UltraSparc-II with 2 x 300 MHz processors and 
1GB main memory, and a 16KB LI data cache, 512KB L2 direct-mapped cache. Observe that 
MSB radix sort easily outperforms the other algorithms for the range of values considered. 



6 Conclusions 

We have analysed the average-case cache performance of the permute phase of distribution sorting 
when the keys are independently but not uniformly distributed. We have presented equations for 
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n = 


1 X lO'* 


2 X 10« 


4 X 10** 


8 X 10** 


16 X 10** 


32 X lO** 


64 X lO** 


MPFlash 


0.6780 


1.3780 


2.2756 


6.1700 


13.308 


27.738 


56.796 


MTQuick 


0.7400 


1.5890 


3.3690 


7.2430 


15.298 


32.092 


67.861 


MSBRadix 


0.3865 


0.8470 


1.9820 


5.0300 


9.4800 


19.436 


40.663 



Table 2: MPFlashsort, memory- tuned Quicksort and MSBRadix. Running times in seconds of 
MPFlashsort, mcmory-tuncd Quicksort and MSBRadix sort on a Sun UltraSparc-II using single 
precision floating point keys. 

the number of misses during in-place and out-of-place permutations and have given closed-form 

upper and lower bounds on these. Wc have shown that the upper and lower bounds arc quite close 
when k < C and the data is known to be independently and uniformly distributed. We have shown 
how this analysis can easily be extended to obtain the number of cache misses during accesses to 
multiple sequences. 

We have shown that if the integer sorting algorithm MSB radix sort is used to sort uniformly 
and randomly distributed floating point numbers then a non-uniform distribution of keys to classes 
is induced. Wc have shown that a naive implementation of this algorithm would have very poor 
performance due to this non-uniform distribution. We have shown that by partitioning the keys, to 
remove keys which are expected to go into small classes, and by using our analysis, the algorithm 
can be tuned for good cache performance. Due to fast integer operations and good cache utilisation 
the tuned algorithm outperforms MPFlashsort, a cache-tuned distribution sorting algorithm, and 
memory-tuned Quicksort. 
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